Term Deposit Bank Case - Ensemble Techniques.

by Rodrigo H Correa

This is the 3rd case presented for the University of Texas at Austin Post Graduate Program in Artificial Intelligence and Machine Learning.

Just like always and as good practice, the first part is always importing the necessary libraries.

In [34]:
# Step one: importing the necessary packages
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

About the Dataset

The Dataset used is called bank-full, and it comprises of ~45k observations with client data and outcome of a campaign focused in increasing the adherence to term deposits.

In [2]:
#Step two: import the csv file
df = pd.read_csv('bank-full.csv')

About the objective

The objective is to create a Machine Learning model, using Ensemble Techniques in order to direct the bank efforts to clients and predict success in the campaigns for the aforementioned product.

Therefore, in order to discover the best model, and, more importantly, figuring out the best course of strategic action for the bank, there will be used a number of techniques to improve those chances.

In [9]:
#Acquiring a basic DataSet description
df.describe().T
Out[9]:
count mean std min 25% 50% 75% max
age 45211.0 40.936210 10.618762 18.0 33.0 39.0 48.0 95.0
balance 45211.0 1362.272058 3044.765829 -8019.0 72.0 448.0 1428.0 102127.0
day 45211.0 15.806419 8.322476 1.0 8.0 16.0 21.0 31.0
duration 45211.0 258.163080 257.527812 0.0 103.0 180.0 319.0 4918.0
campaign 45211.0 2.763841 3.098021 1.0 1.0 2.0 3.0 63.0
pdays 45211.0 40.197828 100.128746 -1.0 -1.0 -1.0 -1.0 871.0
previous 45211.0 0.580323 2.303441 0.0 0.0 0.0 0.0 275.0
In [116]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
Target       45211 non-null object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB
In [117]:
for feature in df.columns: # Loop through all columns in the dataframe
    if df[feature].dtype == 'object': # Only apply for columns with categorical strings
        df[feature] = pd.Categorical(df[feature])# Replace strings with an integer
df.head(10)
Out[117]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 0 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 0 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 0 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 0 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 0 0 unknown no
5 35 management married tertiary no 231 yes no unknown 5 may 139 1 0 0 unknown no
6 28 management single tertiary no 447 yes yes unknown 5 may 217 1 0 0 unknown no
7 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 0 0 unknown no
8 58 retired married primary no 121 yes no unknown 5 may 50 1 0 0 unknown no
9 43 technician single secondary no 593 yes no unknown 5 may 55 1 0 0 unknown no

EDA Analysis Using Pandas Profiling

I like to use Pandas Profiling since it presents a comprehensive, full-scale analysis with little coding. From, those, I will highlight the main discoveries.

In [118]:
#Using pandas profiling for a more detailed understanding of the variables.
profile = ProfileReport(df, title='Pandas Profiling Report', html={'style':{'full_width':True}})
profile.to_notebook_iframe()








Major appointments from Pandas Profiling.

Despite not presenting missing values as so, there are a couple of "strange things going on" with the Dataset.

The first one found is the variable pdays, with a very strong frequency in minus 1, meaning the client was contacted in -1 days passed. This does not seem to make sense.

The second one, a little more subtle but also with a higher consequence: poutcoume, with a very high ocurrence of "unknown" outcomes.

Well, using a prediction model that in the end answers you "I don't know" seems to me pretty much a waste of effort.

For that, those issues will be addressed later on, maybe even presented in different models for comparison.

For now, I'll simply finish the EDA analysis with a couple highlights.

In [119]:
#checking (highlighting) the presence of null values.
df.isnull().sum()
Out[119]:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
Target       0
dtype: int64
In [120]:
# Just some configs for a better Seaborn experience.
sns.set("poster")
sns.set_style('whitegrid')
In [121]:
#Verifying the outcomes of the campaign based on the remaining factors.
sns.pairplot(df, hue = "poutcome")
Out[121]:
<seaborn.axisgrid.PairGrid at 0x26cc28b1288>
In [122]:
#Presenting the Correlation matrix for the variables.
plt.figure( figsize = (20,20))
corr = df.corr()
cmap = sns.diverging_palette(220, 10, as_cmap = True)
sns.heatmap(corr, cmap = cmap,  square = True, linewidth = .5, annot = True)
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x26cc4790288>

Final EDA Remarks

The pairplot and the correlation matrix did not reveal anything standing out. Mainly in terms of importance of features, which do not seem to present one single feature that would, indeed, give us any insight whether there will be a better or worse response.

Dealing with the "funny" Data.

As mentioned before, two major things caught my attention in the Dataset.

1) Variable pdays with negative values: might mean the person will be contacted in the future, but from there we simply cannot infer anything.

2) Variable poutcome with a lot (36k) of registered "unknown" values. This basically means a campaign was conducted with a certain client and from that we have no idea of what happened next, being either failure or success. My two cents is that this is a great deal of noise to be taken care of. There is high incidence of "other" as well, that may indicate, let's say, hiring a different service than the one offered. This is not the target as well, but not necessarily bad.

From that point on, my first task here is to understand and determine the course of action.

pdays: convert to zero, the next logical number.

poutcome: Filter a little more and understand what kind of bias influences this "missing" value.

Even so, I will create new objects and by the end, run the "dirty" model for comparison.

In [123]:
#Converting negative pdays value into zero.
df.pdays.replace({-1:0}, inplace = True)
plt.figure( figsize = (20,20))
plt.xlim(-5, 400)
sns.distplot(df.pdays)
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x26cc832c8c8>

Understanding missing (unknown) values in poutcome

First, it is important to understand the nature of the missing data, since it could be Missing at Random (MAR), Missing Completely at Random (MCAR) or Missing not at random (MNAR): source(https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4).

The main difference is that MAR and MCAR presents missing value for no particular reason compared to the target variable.

There seems that the unknown values is somehow related to the time components of the campaign; it seems we are facing a MAR problem.

Still, the original dataset is kept as df and will remain so to be tested.

In [124]:
#Creating a new Dataset, excluding the "unknown" outcomes, which are pure noise.
known_outcomes = df.poutcome != "unknown"
In [125]:
df2 = df[known_outcomes]
df2.describe().T
Out[125]:
count mean std min 25% 50% 75% max
age 8252.0 40.954556 11.424585 18.0 33.0 38.0 48.00 93.0
balance 8252.0 1557.323558 3061.334465 -1884.0 167.0 603.0 1743.75 81204.0
day 8252.0 14.287203 7.918667 1.0 7.0 14.0 20.00 31.0
duration 8252.0 260.065439 235.142495 1.0 113.0 193.0 324.00 2219.0
campaign 8252.0 2.055986 1.561340 1.0 1.0 2.0 2.00 16.0
pdays 8252.0 224.544353 115.300549 1.0 133.0 194.5 327.00 871.0
previous 8252.0 3.177412 4.561864 1.0 1.0 2.0 4.00 275.0
In [126]:
df.describe().T
Out[126]:
count mean std min 25% 50% 75% max
age 45211.0 40.936210 10.618762 18.0 33.0 39.0 48.0 95.0
balance 45211.0 1362.272058 3044.765829 -8019.0 72.0 448.0 1428.0 102127.0
day 45211.0 15.806419 8.322476 1.0 8.0 16.0 21.0 31.0
duration 45211.0 258.163080 257.527812 0.0 103.0 180.0 319.0 4918.0
campaign 45211.0 2.763841 3.098021 1.0 1.0 2.0 3.0 63.0
pdays 45211.0 41.015195 99.792615 0.0 0.0 0.0 0.0 871.0
previous 45211.0 0.580323 2.303441 0.0 0.0 0.0 0.0 275.0
In [127]:
#Presenting the Correlation matrix for the variables.
plt.figure( figsize = (20,20))
corr = df2.corr()
cmap = sns.diverging_palette(600, 1, as_cmap = True)
sns.heatmap(corr, cmap = cmap,  square = True, linewidth = .5, annot = True)
Out[127]:
<matplotlib.axes._subplots.AxesSubplot at 0x26cc983bdc8>
In [155]:
from sklearn.preprocessing import LabelEncoder
df3 = df2.copy()

lb_make = LabelEncoder()

df3['job'] = lb_make.fit_transform(df2['job'])
df3['marital'] = lb_make.fit_transform(df2['marital'])
df3['education'] = lb_make.fit_transform(df2['education'])
df3['month'] = lb_make.fit_transform(df2['month'])
df3['poutcome'] = lb_make.fit_transform(df2['poutcome'])
df3['contact'] = lb_make.fit_transform(df2['contact'])
df3['default'] = lb_make.fit_transform(df2['default'])
df3['housing'] = lb_make.fit_transform(df2['housing'])
df3['loan'] = lb_make.fit_transform(df2['loan'])
df3['Target'] = lb_make.fit_transform(df2['Target'])
df3.head()
Out[155]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
24060 33 0 1 2 0 882 0 0 1 21 10 39 1 151 3 0 0
24062 42 0 2 1 0 -247 1 1 1 21 10 519 1 166 1 1 1
24064 33 7 1 1 0 3444 1 0 1 21 10 144 1 91 4 0 1
24072 36 4 1 2 0 2415 1 0 1 22 10 73 1 86 4 1 0
24077 36 4 1 2 0 0 1 0 1 23 10 140 1 143 3 0 1
In [128]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
In [184]:
all_feats = ['job', 'marital', 'education', 'default', 'housing', 
             'loan', 'contact', 'month', 'age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']
X = df3[all_feats]
y = df3.Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
In [185]:
import statsmodels.api as sm 
logit_model = sm.Logit(y, X) 
result = logit_model.fit() 
print(result.summary2())
Optimization terminated successfully.
         Current function value: 0.388426
         Iterations 7
                         Results: Logit
================================================================
Model:              Logit            Pseudo R-squared: 0.281    
Dependent Variable: Target           AIC:              6442.5896
Date:               2020-03-12 20:45 BIC:              6554.8810
No. Observations:   8252             Log-Likelihood:   -3205.3  
Df Model:           15               LL-Null:          -4456.2  
Df Residuals:       8236             LLR p-value:      0.0000   
Converged:          1.0000           Scale:            1.0000   
No. Iterations:     7.0000                                      
-----------------------------------------------------------------
             Coef.   Std.Err.     z      P>|z|    [0.025   0.975]
-----------------------------------------------------------------
job         -0.0074    0.0099   -0.7510  0.4527  -0.0268   0.0119
marital     -0.2800    0.0484   -5.7873  0.0000  -0.3748  -0.1852
education    0.0466    0.0422    1.1043  0.2695  -0.0361   0.1294
default     -0.9776    0.5458   -1.7911  0.0733  -2.0474   0.0922
housing     -1.3957    0.0679  -20.5641  0.0000  -1.5287  -1.2626
loan        -0.6713    0.1137   -5.9022  0.0000  -0.8942  -0.4484
contact     -0.1243    0.1064   -1.1684  0.2427  -0.3329   0.0842
month       -0.0189    0.0087   -2.1799  0.0293  -0.0359  -0.0019
age         -0.0236    0.0021  -11.1355  0.0000  -0.0278  -0.0195
balance      0.0000    0.0000    1.4580  0.1449  -0.0000   0.0000
day         -0.0076    0.0037   -2.0505  0.0403  -0.0150  -0.0003
duration     0.0033    0.0001   23.7996  0.0000   0.0030   0.0035
campaign    -0.2129    0.0254   -8.3850  0.0000  -0.2626  -0.1631
pdays       -0.0008    0.0003   -2.9169  0.0035  -0.0014  -0.0003
previous     0.0015    0.0061    0.2430  0.8080  -0.0105   0.0135
poutcome     0.9811    0.0385   25.4920  0.0000   0.9057   1.0565
================================================================

In [189]:
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
Out[189]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')
In [190]:
y_pred = dTree.predict(X_test)
In [191]:
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
1.0
0.7928109854604201
In [192]:
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 6, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))
0.8587257617728532
0.8404684975767367
In [225]:
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Importance"], index = X_train.columns))
           Importance
job          0.007228
marital      0.004279
education    0.000000
default      0.000000
housing      0.066985
loan         0.006125
contact      0.000000
month        0.016961
age          0.011837
balance      0.013278
day          0.016415
duration     0.307550
campaign     0.001634
pdays        0.071327
previous     0.004681
poutcome     0.471699
In [197]:
from sklearn import metrics
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.88      0.85      0.87      1935
           1       0.52      0.58      0.55       541

    accuracy                           0.79      2476
   macro avg       0.70      0.72      0.71      2476
weighted avg       0.80      0.79      0.80      2476

In [224]:
#Plotting the ROC curve
plt.figure( figsize = (20,10))
y_pred_proba = dTree.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.title('AUC dTree')
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
In [227]:
print(dTreeR.score(X_test , y_test))
y_predict = dTreeR.predict(X_test)

cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm,cmap = cmap, annot=True ,fmt='g')
0.8404684975767367
Out[227]:
<matplotlib.axes._subplots.AxesSubplot at 0x26cc2ad8088>
In [207]:
from sklearn.ensemble import BaggingClassifier

bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1)
#bgcl = BaggingClassifier(n_estimators=50,random_state=1)

bgcl = bgcl.fit(X_train, y_train)
In [228]:
y_predict = bgcl.predict(X_test)

print(bgcl.score(X_test , y_test))

cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm, cmap = cmap, annot=True ,fmt='g')
0.8469305331179321
Out[228]:
<matplotlib.axes._subplots.AxesSubplot at 0x26cc00b87c8>
In [ ]:
 
In [210]:
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators=10, random_state=1)
#abcl = AdaBoostClassifier( n_estimators=50,random_state=1)
abcl = abcl.fit(X_train, y_train)
In [229]:
y_predict = abcl.predict(X_test)
print(abcl.score(X_test , y_test))

cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm, cmap = cmap, annot=True ,fmt='g')
0.8558158319870759
Out[229]:
<matplotlib.axes._subplots.AxesSubplot at 0x26cc34b40c8>
In [212]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)
gbcl = gbcl.fit(X_train, y_train)
In [230]:
y_predict = gbcl.predict(X_test)
print(gbcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm, cmap = cmap,  annot=True ,fmt='g')
0.8667205169628432
Out[230]:
<matplotlib.axes._subplots.AxesSubplot at 0x26cc27093c8>
In [214]:
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50, random_state=1,max_features=12)
rfcl = rfcl.fit(X_train, y_train)
In [231]:
y_predict = rfcl.predict(X_test)
print(rfcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (10,10))
sns.heatmap(df_cm,cmap = cmap, annot=True ,fmt='g')
0.8501615508885298
Out[231]:
<matplotlib.axes._subplots.AxesSubplot at 0x26cc27d9b88>
In [ ]: